The following tutorial shows the basics of setting up and interacting with an ArchR Project using a gold-standard dataset of hematopoietic cells ( CITATION ). This tutorial and all of the accompanying vignettes assume that you are running ArchR locally. Once all of these steps work for you, feel free to set up ArchR to work in a cluster environment. This tutorial does not explain every detail of every step. Please see the Vignettes section for more details on each major analytical step and all of the major features of ArchR.

What is an ArrowFile / ArchRProject?

The base unit of an analytical project in ArchR is called an ArrowFile. Each ArrowFile, stores all of the data associated with an individual sample. Here, a sample would be the most detailed unit of analysis desired (for ex. a single replicate of a particular condition). During creation and as additional analyses are performed, ArchR updates and edits each ArrowFile to contain additional layers of information. Then, an ArchRProject allows you to associate these ArrowFiles together into a single analytical framework.

Certain actions can be taken directly on ArrowFiles while other actions are taken on an ArchRProject which in turn updates each associated ArrowFile. Because ArrowFiles are stored as large HDF5-format files, “get-er” functions in ArchR retrieve data by interacting with the ArchRProject.

Getting Set Up

The first thing we do is set up our working directory, load our genome annotations, and set the number of threads we would like to use. Depending on the configuration of your local environment, you may need to modify the number of threads used below.

#Load R Libraries
library(ArchR)

#Set/Create Working Directory to Folder for Analysis
wd <- ""
dir.create(wd, showWarnings = FALSE, recursive = TRUE)
setwd(wd)

#Load Genome Annotations. Available annotations are for Hg19, Hg38, Mm9, or Mm10. 
#(Note if you want to build a custom annotatino see createGeneAnnotation or createGenomeAnnotation).
data("geneAnnoHg19")
data("genomeAnnoHg19")
geneAnno <- geneAnnoHg19
genomeAnno <- genomeAnnoHg19

#Set Default Threads for ArchR Functions
#By default ArchR uses the total number of cores / 2.
addArchRThreads()

Creating Arrow Files

For this tutorial, we will download a collection of fragment files. Fragment files are one of the base file types of the 10x Genomics analytical platform and can be easily created from any bam file. See the ArchR input types vignette for information on making your own fragment files. Once we have our fragment files, we provide their names as a vector to createArrowFiles. During creation, some basic matrices and data is added to each ArrowFile including a TileMatrix containing insertion counts across genome-wide 500-bp bins.

#Get Tutorial Data ~2.2GB To Download (if downloaded already ArchR will bypass downloading)
inputFiles <- getTutorialData("Hematopoiesis")

#Create Arrow Files (~10-15 minutes)
#It is important
ArrowFiles <- createArrowFiles(
  inputFiles = inputFiles,
  sampleNames = names(inputFiles),
  geneAnno = geneAnno,
  genomeAnno = genomeAnno,
  threads = threads,
  force = FALSE
)

Tidying up our data and creating an ArchRProject

One major source of trouble in single-cell data is the contribution of “doublets” to the analysis. A doublet refers to a single droplet that received a single barcoded bead and more than one nucleus. This causes the reads from more than one cell to appear as a single cell. We remove these computationally and describe this doublet removal process in more depth in the doublet removal vignette.

#Add Infered Doublet Scores to each Arrow File (~5-10 minutes)
doubScores <- addDoubletScores(ArrowFiles, threads = threads)

#Create ArchRProject
proj <- ArchRProject(
  ArrowFiles = ArrowFiles, 
  geneAnnotation = geneAnno,
  genomeAnnotation = genomeAnno,
  outputDirectory = "Heme_Tutorial"
)

#Filter Doublets
proj <- filterDoublets(proj)

Dimensionality Reduction

At this point, we have an ArchR project that is ready to be used in downstream visualizations and analyses. The first thing we will do is use an iterative latent semantic indexing (LSI) approach to define clusters in our data. Once we have identified clusters in our data, we can plot a UMAP embedding. For more details, see the dimensionality reduction vignette.

#Reduce Dimensions with Iterative LSI (~5-10 minutes)
proj <- addIterativeLSI(
  ArchRProj = proj, 
  useMatrix = "TileMatrix", 
  reducedDimsOut = "IterativeLSI",
  threads = threads
)

#Identify Clusters from Iterative LSI
proj <- addClusters(input = proj, reducedDims = "IterativeLSI", resolution = 1.2)

#Add Imputation Weights for Visualization
proj <- addImputeWeights(ArchRProj = proj)

#Compute a UMAP embedding to visualize our tiled matrix
proj <- addEmbedding(
  ArchRProj = proj, 
  reducedDims = "IterativeLSI", 
  embedding = "UMAP", 
  embeddingParams = list(min_dist = 0.4),
  force = TRUE
)

#Plot the UMAP Embedding with Metadata Overlayed
plotList <- list()
plotList[[1]] <- plotEmbedding(ArchRProj = proj, colorBy = "colData", name = "Sample")
plotList[[2]] <- plotEmbedding(ArchRProj = proj, colorBy = "colData", name = "Clusters", plotParams = list(labelMeans=TRUE))
plotPDF(plotList = plotList, name = "UMAP-Samples-Clusters", width = 6, height = 6, ArchRProj = proj)
Using our tutorial data, your UMAP plots should look like this. (Note if you see a blank space below try firefox or safari)

Identifying Cluster Cell Types Using Marker Genes

In order to understand which clusters correspond to which cell types, we use a supervised approach based on prior knowledge of the genes that are active in specific cell types. We determine gene activity scores for each putative marker gene based on chromatin accessibility signal in the region surrounding the gene’s promoter. We can then overlay these gene activity scores on our UMAP embedding to visualize the relationship between gene activity and cluster. For more details, see the marker genes vignette.

markerGenes  <- c(
    "CD34", #Early Progenitor
    "GATA1", #Erythroid
    "PAX5", "MS4A1", #B-Cell Trajectory
    "CD14", #Monocytes
    "CD3D", "CD8A", "TBX21", "IL7R" #TCells
  )

#Plot the UMAP Embedding with Marker Genes Overlayed
plotList <- list()
plotList[[1]] <- plotEmbedding(ArchRProj = proj, colorBy = "GeneScoreMatrix", name = markerGenes)
plotPDF(plotList = plotList, name = "UMAP-Marker-Gene-Scores", width = 6, height = 6, ArchRProj = proj)

#Plot Tracks at Marker Genes
plotTracks <- ArchRRegionTrack(ArchRProj = proj, threads = threads, geneSymbol = markerGenes)
plotPDF(plotList = plotTracks, name = "Tracks-Marker-Genes", width = 6, height = 8, ArchRProj = proj)

#Identify Marker Gene through Pairwise Test vs Null
markersGS <- markerFeatures(ArchRProj = proj, threads = threads, useMatrix = "GeneScoreMatrix")
heatmapGS <- markerHeatmap(
  seMarker = markersGS, 
  cutOff = "FDR <= 0.01 & Log2FC >= 1", 
  labelMarkers = markerGenes
)
plotPDF(heatmapGS, name = "GS-Marker-Heatmap", width = 8, height = 12, ArchRProj = proj)
Using our tutorial data, your gene activity score UMAP plots should look like this. (Note if you see a blank space below try firefox or safari)

Creating a Reproducible Peak Set

One of the most complicated aspects about ATAC-seq and scATAC-seq analysis is the generation of a reproducible and robust peak set. In ArchR, we use an iterative overlap removal process that we first described in Corces* & Granja* et al. Science 2018. This process is described in detail in the peak calling vignette.

To robustly call peaks, we first merge the sparse single-cell data into pseudo-bulk replicates by aggregating the insertions from many individual cells into a single group. We make multiple pseudo-bulk replicates for each cluster to enable an assessment of peak reproducibility. This process of pseudo-bulk generation is described in detail in the pseudo-bulk generation vignette. We than call peaks using MACS2 v??? and perform our iterative overlap removal. Once we obtain a finalized peak set, we collect insertion counts in each peak for each single cell and associate this with the corresponding ArrowFile via the ArchRProject.

#Create Group Coverage Files that can be used for downstream analysis (~5-10 minutes)
proj <- addGroupCoverages(ArchRProj = proj, groupBy = "Clusters", threads = threads)

#Call Reproducible Peaks w/ Macs2 (~5-10 minutes)
proj <- addReproduciblePeakSet(ArchRProj = proj, groupBy = "Clusters", threads = threads)

#Add Peak Matrix
proj <- addPeakMatrix(ArchRProj = proj, threads = threads, force = TRUE)

Identifying Marker Peaks

Often times, we are interested to know which peaks are unique to an individual cluster or a small group of clusters. We can do this in an unsupervised fashion in ArchR:

#Identify Marker Peaks
markersPeaks <- markerFeatures(ArchRProj = proj, threads = threads, useMatrix = "PeakMatrix")
heatmapPeaks <- markerHeatmap(
  seMarker = markersPeaks, 
  cutOff = "FDR <= 0.01 & Log2FC >= 1"
)
plotPDF(heatmapPeaks, name = "Peak-Marker-Heatmap", width = 8, height = 12, ArchRProj = proj)

Using our tutorial data, your marker peak heatmap should look like this.

Performing Motif Enrichments

QQQQQQQ - I think the concept of TF deviations is still abstract to most people. Does ArchR do motif enrichment with hypergeometric test as well??

Using the reproducible peak set that we defined above, we can use ArchR to calculate TF deviations on a single-cell basis for transcription factors in the peaks identified in each cluster. We can then overlay these deviations on on UMAP embedding. This effectively infers differences in TF activity across all single cells and is very useful in identifying regulatory factors governing cell fate.

#Motif Search in PeakSet
proj <- addMotifAnnotations(ArchRProj = proj, name = "Motif")

#Add chromVAR Deviations (~20-25 min if using CisBP MotifSet)
proj <- addDeviationsMatrix(ArchRProj = proj, threads = threads)

#Identify Variable TFs
head(getVarDeviations(proj, plot = FALSE)$name, 25)

#To access motif need to specify deviations,z : motif_name
#Try getFeatures with MotifMatrix to see available names
getFeatures(proj, select = "PAX5", useMatrix = "MotifMatrix")

#Define the list of motifs to plot
motifs <- c("GATA1_383", "CEBPA_155", "EBF1_67","IRF4_632","TBX21_780","PAX5_709")

#Plot the UMAP Embedding with chromVAR Deviations Overlayed
plotList <- list()
plotList[[1]] <- plotEmbedding(ArchRProj = proj, colorBy = "colData", name = "Clusters")
plotList[[2]] <- plotEmbedding(ArchRProj = proj, colorBy = "MotifMatrix", name = paste0("z:",motifs))
plotPDF(plotList = plotList, name = "Plot-UMAP-TileLSI-MotifMatrix", width = 6, height = 6, ArchRProj = proj)

Using our tutorial data, your marker peak heatmap should look like this.

Performing TF Footprinting

Transcription factor footprinting can also be done in ArchR with a single command. We note that the footprints generated by the tutorial data are not as clean as would be desired but this is because of the small size of the tutorial dataset.

#Plot Motif Footprints from positions list
#Recommend doing a few motifs not entire motif set
seFoot <- plotFootprints(proj, positions = getPositions(proj)[motifs])

click here: link2 ### Additional Tips {#tips}